Input Formats

DeepChem supports a whole range of input files. For example, accepted input formats for deepchem include .csv, .sdf, .fasta, .png, .tif and other file formats. The loading for a particular file format is governed by Loader class associated with that format. For example, with a csv input, we use the CSVLoader class underneath the hood. Here's an example of a sample .csv file that fits the requirements of CSVLoader.

  1. A column containing SMILES strings [1].
  2. A column containing an experimental measurement.
  3. (Optional) A column containing a unique compound identifier.

Here's an example of a potential input file.

Compound ID measured log solubility in mols per litre smiles
benzothiazole -1.5 c2ccc1scnc1c2

Here the "smiles" column contains the SMILES string, the "measured log solubility in mols per litre" contains the experimental measurement and "Compound ID" contains the unique compound identifier.

[2] Anderson, Eric, Gilman D. Veith, and David Weininger. "SMILES, a line notation and computerized interpreter for chemical structures." US Environmental Protection Agency, Environmental Research Laboratory, 1987.

Data Featurization

Most machine learning algorithms require that input data form vectors. However, input data for drug-discovery datasets routinely come in the format of lists of molecules and associated experimental readouts. To
transform lists of molecules into vectors, we need to subclasses of DeepChem loader class dc.data.DataLoader such as dc.data.CSVLoader or dc.data.SDFLoader. Users can subclass dc.data.DataLoader to load arbitrary file formats. All loaders must be passed a dc.feat.Featurizer object. DeepChem provides a number of different subclasses of dc.feat.Featurizer for convenience.